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ABSTRACT 


Interactive simulations allow students to independently ex- 
plore scientific phenomena and ideally infer the underlying 
principles through their exploration. Effectively using such 
environments is challenging for many students and there- 
fore, adaptive guidance has the potential to improve stu- 
dent learning. Providing effective support is, however, also a 
challenge because it is not clear how effective inquiry in such 
environments looks like. Previous research in this area has 
mostly focused on grouping students with similar strategies 
or identifying learning strategies through sequence mining. 
In this paper, we investigate features and models for an early 
prediction of conceptual understanding based on clickstream 
data of students using an interactive Physics simulation. To 
this end, we measure students’ conceptual understanding 
through a task they need to solve through their exploration. 
Then, we propose a novel pipeline to transform clickstream 
data into predictive features, using latent feature represen- 
tations and interaction frequency vectors for different com- 
ponents of the environment. Our results on interaction data 
from 192 undergraduate students show that the proposed 
approach is able to detect struggling students early on. 


Keywords 
skip-grams, early classification, interactive simulations, con- 
ceptual understanding 


1. INTRODUCTION 


Over the last years, interactive simulations have been in- 
creasingly used for science education (e.g, the PhET simu- 
lations alone are used over 45M times a year [1]). Interactive 
simulations allow students to engage in inquiry-based learn- 
ing: they can design experiments, take measurements, and 
test their hypotheses. Ideally, students discover the prin- 
ciples and models of the underlying domain through their 
own exploration [2], but students often struggle to effec- 
tively learn in such environments [3, 4, 5]. A possible reason 
for this is that interactive simulations are usually complex 
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and unstructured environments allowing students to choose 
their own action path [6]. Providing adaptive guidance to 
students has therefore the potential to improve learning out- 
comes. 


Implementing effective support in interactive learning en- 
vironments is a challenge in itself: the complexity of the 
environment makes it difficult to define a priori how suc- 
cessful student behaviour looks like. Previous research has 
focused on leveraging sequence mining and clustering tech- 
niques to identify the key features of successful interactions. 
For example, [7] have used an information theoretic sequence 
mining approach to detect differences in the interaction se- 
quences of students with high and low prior knowledge, 
while [8] investigated the effects of prior knowledge activa- 
tion. Other work [9] focused on detecting behaviours leading 
to the design of a correct causal explanation. [10] identified 
key factors for successful inquiry: focusing on an unknown 
component and building contrastive cases. Similarly, [11] 
found that the identification of the dependent variable and 
its isolated manipulations lead to a better quantitative un- 
derstanding of the phenomena at hand. Another technique 
is to manually categorise students’ log data and use the 
tags as ground truth for a classifier of successful inquiry 
behaviour [12]. [13] developed a dashboard displaying in- 
formation about the mined sequences to guide teachers in 
building their lessons. 


More work has focused on analysing and predicting students’ 
strategies in different types of open ended learning environ- 
ments (OELEs), such as educational games. Prior research 
in that domain has, for example, investigated students’ prob- 
lem solving behaviour [14], analysed the effect of scaffolding 
on students’ motivation [15], extracted strategic moves from 
video learning games [16], detected different types of confu- 
sion [17], or identified students’ exploration strategies [18]. 


Most of the previous work on OELEs has performed a pos- 
teriori analyses. However, in order to provide students with 
support in real-time, we need to be able to detect strug- 
gling students early on. Due to the lack of clearly defined 
student trajectories and underlying skills, building a model 
of students’ learning in OELEs is challenging. A promis- 
ing approach for early prediction in OELEs is the use of a 
clustering-classification framework [19]: in the (first) offline 
step, students are clustered based on their interaction data 
and the clustering solution is interpreted. The second step 
is online: students are assigned to clusters in real-time. This 
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Figure 1: User interface of the PhET Capacitor Lab with different plate parameters for a closed (A) and an open circuit (B). 
The two initial closed circuit configurations and the resulting four open circuit configurations presented to the participants in 
the capacitor ranking task (C). (Simulation image by PhET Interactive Simulations, University of Colorado Boulder, licensed 


under CC-BY 4.0, https://phet.colorado.edu). 


framework has been successfully applied to analyse and pre- 
dict students’ trajectories in mathematics learning [20], to 
differentiate between ‘high’ and ‘low’ learners [21], to build 
student models for interactive simulations [22], or to predict 
students’ exploration strategies in an educational game [23]. 


In this paper, we aim at early predicting conceptual un- 
derstanding based on students’ log data from an interactive 
Physics simulation. All our analyses are based on data col- 
lected from 192 undergraduate Physics students interacting 
with a PhET simulation. We propose a novel pipeline for 
transforming clickstream data into predictive features using 
latent feature representations and frequency vectors. Then, 
we extensively evaluate and compare various combinations 
of predictive algorithms and features on different classifica- 
tion tasks. In contrast to previous work using unsupervised 
clustering to obtain student profiles [21, 20, 23, 22], our 
learning activity with the simulation includes a task specifi- 
cally designed to assess students’ conceptual understanding. 
With our analyses, we address three research questions: 1) 
Can students’ interaction with the data be associated with 
the gained conceptual understanding? 2) Can conceptual 
understanding be inferred through sequence mining meth- 
ods with embeddings? 3) Can the proposed methods be 
used for early predicting students’ conceptual understand- 
ing based on partial sequences of interaction data? 


Our results show that all tested models are able to predict 
students’ conceptual knowledge with a high AUC when ob- 
serving students’ full sequences (offline). The best models 
are also able to detect struggling students early on and to 
provide a more fine-grained prediction of students’ concep- 
tual knowledge later during interaction. 


2. CONTEXT AND DATA 


All experiments and evaluations of this paper were con- 
ducted using data from students exploring an interactive 
simulation. In the following, we describe the learning activ- 
ity, the data collection, and the categorisation of students’ 
conceptual understanding at the end of the learning activity. 


Learning Activity. The data for this work was collected in 
a user study where participants were asked to engage in 
an inquiry-based learning activity with the PhET Capacitor 
Lab simulation’. The Capacitor Lab is an interactive sim- 
ulation with a simple and intuitive interface allowing users 
to explore the principles behind a plate capacitor (Fig. 1A 
and B). Specifically, students can load the capacitor by ad- 
justing the battery and observe how the capacitance and 
the stored energy of the capacitor change when adjusting 
the voltage, the area of the capacitor plates or the distance 
between them. After loading the capacitor, the circuit can 
be opened through a switch, and students can again observe 
how manipulation of the different components influences ca- 
pacitance and stored energy. Moreover, the simulation pro- 
vides a voltmeter, while check boxes in the interface allow 
users to enable or disable visualisations of specific measures. 


Based on this simulation, a learning activity was designed 
in which participants had to explore the relationships be- 
tween the different components of the circuit and rank four 
different capacitor configurations by the amount of stored 
energy. The configurations were generated based on two ini- 
tial setups (I and II, respectively) representing capacitors in 
a closed circuit with different settings for battery voltage, 
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plate area and separation (Fig. 1C). For each initial setup, 
two open circuit configurations were generated (i.e. config- 
urations 1 & 2 from setup I and 3 & 4 from setup II) by 
opening the switch and then changing the values for plate 
area and separation. To complete this ranking task, partici- 
pants were allowed to use the simulation for as much time as 
they needed. It should be noted that the values in the rank- 
ing task were chosen outside the ranges of the adjustable 
values in the simulation such that students could not simply 
reproduce the four configurations in the simulation, but had 
to solve the task by figuring out the relationships between 
the different components and the stored energy. 


Data Collection. Data was collected from 214 first-year un- 
dergraduate Physics students who completed the capacitor 
ranking task as part of a homework assignment. While 
working with the simulation, students’ interaction traces 
(i.e. clicks on check boxes, dragging of components, moving 
of sliders) were automatically logged by the environment. 
Moreover, it recorded students’ final answers in the rank- 
ing task (i.e. the ranking of the four configurations). All 
data was collected in a completely anonymous way and the 
study was approved by the responsible institutional review 
board prior to the data collection (HREC number: 050- 
2020/05.08.2020). After a first screening of the data, several 
log files (9) were excluded because of inconsistencies in the 
data, and another 13 because they had barely any interac- 
tion (less than 10 clicks) with the environment. Removing 
these data points resulted in a data set of 192 students used 
for our analyses. 


Categorisation of conceptual understanding. The design of 
the ranking task allows to relate students’ responses to their 
conceptual understanding of a capacitor. For this purpose, 
we analysed the 16 (out of 24 possible) rankings submit- 
ted by the students with regards to conceptual understand- 
ing and grouped them accordingly. To this end, three con- 
cepts of understanding associated with the functioning of 
capacitors were evaluated in a top-down approach: answers 
were first separated by those representing an understand- 
ing of both the open and closed circuit (label both), and 
those only representing an understanding of the closed cir- 
cuit (label closed). For those answers representing an under- 
standing of both the open and closed circuit, two cases were 
distinguished. It was assumed that students who chose the 
only correct ranking of the configurations (“4213”) gained an 
exhaustive understanding of the underlying concepts (label 
correct). Students who instead chose one of the other rank- 
ings were assumed to know how plate area and separation 
influence the stored energy in both the open and closed cir- 
cuits, but failed to discover the influence of voltage on stored 
energy (label areasep). Within those answers that only rep- 
resented an understanding of the capacitor’s functioning in 
the closed circuit, we also distinguished between two cases. 
The first case represents the answer that would be consid- 
ered correct if the task was to order the four configurations 
by capacitance instead of energy (“1324”, label capacitance). 
Interestingly, 47 students (i.e. 24% of all students) submit- 
ted this ranking as an answer. The second case represents all 
other possible answers (label other) that could be submitted 
if (a part of) the closed circuit was understood. 


Based on these three underlying concepts, we generated a 


Circuit 
understanding 
Open and closed Closed only 
BOTH CLOSED 
Exhaustive Ranking 
understanding performed by 
Yes No Capacitance Other 
4213 (38) 4231 (25) 1324 (47) 1243 (5) | 4123 (1) 
correct | 2431 (2) capacitance | 1342 (3) | 4132 (2) 
4321 (10) 3412 (4) | 1432 (1) 
2413 (20) 3124 (3) 134 (1) 
2143 (3) 1234 (27) 
AREASEP OTHER 


Figure 2: Tree used to map the 16 different rankings submit- 
ted by the students to class labels associated with conceptual 
understanding of a capacitor. The different class labels are 
indicated in capitalised letters. The numbers in parentheses 
indicate the number of submissions for each ranking. 


decision tree with four leaves (each representing a group 
with similar conceptual understanding) and mapped all 16 
rankings submitted by the students to the leaves (Fig. 2). 
These generated class labels will serve as ground truth labels 
for the classification task presented in the following sections. 


3. METHOD 


Using our proposed approach, we are interested in predicting 
the conceptual understanding students gain from interact- 
ing with the simulation. Therefore, we are solving a super- 
vised classification problem, i.e. we aim at predicting the 
class labels (representing students’ conceptual understand- 
ing) based on the observed student interactions. Our model 
building process to solve this classification problem consists 
of four steps (Fig. 3). We first extract the raw clickstream 
events from the logs and process them into action sequences. 
We then compute three different types of features for each 
action sequence and feed them into our classifiers. 


Event Logs. From the simulation logs, we extract the click- 
stream data of each student s as follows: anything between 
a mouse click/press and a mouse release qualifies as an event, 
while anything between a mouse release and mouse click/press 
is called break. Each event is then labelled by the compo- 
nent the user was interacting with at the mouse click, and 
chronologically arranged with the breaks into a sequence. 


Action Processing. We distinguish three main components 
on the platform, whose values can be changed: 1) the voltage, 
2) the separation between the plates, and 3) the area of the 
plates. An action on these components can be conducted in 
a) an opened circuit or b) a closed circuit, with the stored 
energy information display i) on or ii) off. We categorise 
each event involving these main actions by the combination 
of: the action on the component {1), 2), 3)}, the circuit 
state {a), b)}, and the stored energy display {i), ii)}. Any 
other event is categorised as 4) other. 


The sequence of each student s is now composed of chrono- 
logically ordered events (divided into the 13 different cate- 
gories listed above) separated by breaks. The breaks may 
be caused by the student being inactive due to observing 
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Figure 3: Schematic overview of the classification pipeline. Raw clickstream events are extracted from log data and processed 
into an action sequence per user. Three types of features are computed for each action sequence. Classification is performed 
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Figure 4: Distribution of break lengths in seconds, across all 
students. The bold gray line denotes the 60% threshold. 


the progression of a value, reflecting about an observation, 
or taking notes. Due to its definition as the period between 
a mouse release and a mouse click/press, a break may also 
appear due to logistic reasons, such as moving the mouse 
from one component in the simulation to another compo- 
nent. Indeed, the students’ event sequences consist of many 
short breaks (Fig. 4). Like stop words in natural texts, our 
assumption is that these very short breaks, though very fre- 
quent in our sequences, do not contain much information. 
In fact, our classification over students’ understanding may 
be impaired if those noisy states are not removed, like it 
is the case for sentiment analysis when stop words are not 
deleted [24]. To determine the threshold at which the breaks 
are removed, we plot the distribution of inactivity periods, 
and cut at the elbow of the curve for each student, which 
corresponds to a delimitation at 60%, i.e. for each student 
we keep the top 40% of breaks. We then categorise each of 
our remaining breaks similarly to our main action events: by 
component 5) break, circuit state {a), b)}, and stored energy 
display {i), ii)}, resulting in four different break categories. 


The resulting sequence rs for student s is the chronological 
timeline of the student’s events and breaks, divided into 17 
categories. We refer to this timeline rg as the raw sequence 
of interactions for the rest of the paper. We denote the 
length (corresponding to the total number of interactions of 
student s) of rs with Ns, ie. |rs| = Ns. On average, these 
sequences have a length of N, = 67.86 + 42.56. In terms of 
seconds, the sequences rs lasted on average 512.18 + 435.57 
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Figure 5: Timeline of an exemplary student for each class la- 
bel displaying the chronological sequence of interactions with 
the three main components of the simulation. The green and 
orange bars indicate whether the student displayed the capac- 
itance and stored energy. The background indicates whether 
the interactions were conducted in a closed (grey) or open 
(white) circuit. 


seconds. We also introduce the notion of time, which we de- 
fine related to a student’s interactions: at time ¢ the length 
of the raw sequence of interactions for student s is t, i.e. 
|rs,t| = ¢t. We denote the maximum time of student s with 
T;, corresponding to the full sequence rs. Figure 5 visualises 
the timelines for an exemplary student of each class label. 
It can be observed that for these examples, certain aspects 
of conceptual understanding could be inferred by visual in- 
spection (e.g. the capacitance student never activated the 
check box to visualise the stored energy). However, other 
differences in conceptual understanding are more difficult to 
detect by humans (e.g. the differences between the correct 
and areasep students). 


Feature Creation. Next, we transform the interactions in 
each sequence to obtain three different types of features: 
Action Counts, Action Span, and Pairwise Embeddings. 


To obtain the Action Counts features F'4c,s for a student s, 
we first transform each interaction within the raw sequence 
rs in a one-hot encoded vector, resulting in a 17-dimensional 
vector hg; for each interaction 7 and hence, a sequence of 
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vectors H, = {h.;} with i = 1,...,Ns. To compute fac,s.t 
for student s at time step t, we compute the average over 
hg; with i = 1,...,t: facst = $e hs. By using this 
aggregating technique, our features are translated to the (av- 
eraged) number of times each student has interacted in each 
of our categories. Therefore, for each student s, we end up 
with a feature set Fac,s = {fac,s} with t = 1,...,Ts. 


The computation of the Action Span features F'4s,s for a stu- 
dent s is very similar. Rather than looking into the number 
of times a student s has interacted with each of our compo- 
nents in a particular state, we look into the amount of time 
(in seconds) s has spent in each of the categories. We first 
transform every interaction i within rs into a 17-dimensional 
vector hg ;: This vector is 0 for all dimensions but the di- 
mension d corresponding to the category the i, interaction 
belongs to. This representation is similar to a one-hot en- 
coded vector, but instead of filling in a 1 at dimension d, 
we fill in the duration (in seconds) of interaction i. This re- 
sults in a sequence of vectors Hs = {hs} with i = 1,..., Ns. 
We then compute the feature vector fass+ for student s at 
time ¢ in two steps. In a first step, we again average over all 
vectors hg; up to time ¢, leading to fas st = ey hg i. 
We then normalise fasst to obtain fas.st. By using this 
aggregation technique, our feature vector fas js. represents 
the relative amount of time student s has spent in each cat- 
egory up to time t. For each student s, we end up with with 
a feature set Fas,s = {fas,s,t} with ¢ = 1,..., Ts. 


The third feature, Pairwise Embeddings, is fundamentally 
different from the the two other features: we replace each in- 
teraction in the raw sequence by an embedding vector which 
we obtain by training a pairwise skip-gram [25]. The archi- 
tecture of such a network consists of two dense layers: an 
embedding layer followed by a classification layer. Usually 
applied to natural language applications (NLP), its primary 
goal is to predict the context of a word. Here, our pairwise 
skip-gram attempts to predict the behavior of a student in 
the simulation before and after performing a specific inter- 
action. The skip-gram model can be formulated as: 


p = softmax(W2 - (W1 - a)) (1) 


It takes a, an interaction we wish to predict the context of 
as an input, and outputs p, a probability vector which con- 
tains the likelihood of a being surrounded by each possible 
interaction. W, and W2 represent the weight matrices (em- 
beddings). For each interaction a, we feed 2-w pairs into 
the network, where w is the so-called window size of the 
model (context). The first element of each pair is a. The 
second element of each pair, the ground truth label, is one 
of the w interactions preceding or following a. For example, 
a window size of w = 2 would yield the following pairs for 
action a: (a, a 2); (a, a 1), (a, a4 ne (a, a4 2). 


In our case, to obtain the set of pairs J; for a student s, 
we first again transform each interaction within the raw se- 
quence rs in a one-hot encoded vector, resulting in a 17- 
dimensional vector hs; for each interaction 7 and, hence, 
a sequence of vectors H; = {hsj;} with 7 = 1,..., Ns. We 
then build the input pairs for each interaction i, i.e. [is = 
{(hs,i, hsj)} with 7 € {—w,...,w}\0. We obtain the set of 
pairs for all students as J = {I;} with s = 1,..,S, where S 
is the total number of students. 


One hot Softmax 
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vector layer 
Hidden T=] 
y linear layer 0.1 
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Figure 6: Skip-gram architecture. After training of the skip- 
gram, the corresponding row of the weight matrix W  repre- 
sents a structure preserving embedding of action a. 


After training the skip-gram model on J, we build the fea- 
tures F'pw,s for each student s. The weighs W;, of the hidden 
layer represent a structure preserving embedding of the in- 
teractions. In our case, Wi has dimensions 17 x D (D is the 
embedding dimension). For each interaction 7 of student s, 
we get the corresponding row r in Wi, ie. rs; = Wi - hg i. 
This again results in a sequence of vectors Rs = {rs,i} with 
i = 1,...,Ns. To compute fpw,s, for student s at time 
step t, we compute the average over rs with 7 = l,...,t: 
fpws,t = + yk rs,i- Therefore, for each student s, we end 
up with a feature set Fpw,s = {fpw,s,t} with t = 1,...,Ts. 


Classification. To perform the classification task, we explore 
two different approaches: Random Forests and Fully Con- 
nected Deep Neural Networks. 


Random Forests (RFs) are simple, yet powerful machine 
learning algorithms. They consist of an ensemble of decision 
trees, each trained on a different subset of samples and a dif- 
ferent subset of features. The decisions of each tree are then 
aggregated to determine the final prediction of a sample. 
The strength of this method is that overfitting is prevented 
through the randomisation of training samples and features 
during the training of each tree and that the strengths of 
several good classifiers are exploited. While RF classifiers 
are well tested and efficient to train, they require the input 
features to have the same dimension for every sample. We 
therefore train separate RF models for each time step t. The 
input features for the RF model for time step t are {fus,}, 
with M € AC,AS,PW and s = 1,...,S, where S denotes 
the number of students. The output of the RF is a vector 
PRF,M,s,t Of dimension C' (with C denoting the number of 
classes) for each student s, which represents the probability 
of each class. 


Neural Networks (NNs) were built with the idea of emu- 
lating neurons firing in our brain: their nodes are to the 
neurons what their edges are to their axons. The advantage 
of those deep networks is that they are able to model non- 
linear decision boundaries. However, the back propagation 
calculations make them relatively slow to train. In this work, 
we use a Fully Connected Deep Neural Network consisting 
of d hidden dense layers and one classification layer with a 
softmax activation. Similar to RFs, our NN model requires 
features to have the same dimension for all the samples. We 
therefore also train the NN models for fixed points in time. 
The input features for the NN model for time step ¢ are 
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{fst}, with M € AC,AS,PW and s = 1,...,S, where S 
denotes the number of students. Due to the softmax activa- 
tion, the output of the NN is a vector pNn,m,s,t of dimension 
C (with C denoting the number of classes) for each student 
s, which represents the probability of each class. 


4. EXPERIMENTAL EVALUATION 


We evaluated the predictive performance of our classifica- 
tion pipeline on the collected data set (see Section 2). We 
conducted experiments to compare the performance of dif- 
ferent model and feature combinations for the students’ full 
data sequences as well as for early classification using partial 
sequences of the data, to answer our research questions. 


Experimental Setup. We applied a train-test setting for all 
the experiments, i.e. parameters were fitted on a training 
data set and performance of the methods was evaluated on 
a test data set. Predictive performance was evaluated using 
the macro-averaged area under the ROC curve (AUC). We 
used the AUC as a performance measure as it is robust to 
class imbalance. 


We performed all our experiments using different levels of 
detail for the classification task. As ground truth, we used 
the class labels presented in Fig. 2. Given the hierarchical 
nature of the decision tree separating the students in classes 
based on their conceptual knowledge, we performed the clas- 
sification task focusing on three different levels of detail: 


e 2-class case: starting at the root of the decision tree 
(see Fig. 2), we divide students into two classes based 
on their understanding of the circuits: both (98 stu- 
dents) and closed (94 students). 


e 3-class case: going one step down in the hierarchy of 
the tree, we further divide the left branch of the tree 
(see Fig. 2) based on whether the students have com- 
pletely understood all concepts (leading to a correct 
answer in our ranking task) or not. We therefore ob- 
tain three different classes: correct (38 students), ar- 
easep (60 students), and closed (94 students). 


e 4-class case: here, we also split the right branch of 
the tree (see Fig. 2) and divide the students into two 
groups based on whether they ranked the configura- 
tions in the task based on capacitance, resulting in four 
classes: correct (38 students), areasep (60 students), 
capacitance (47 students), and other (47 students). 


For each of those three cases, we trained two types of classi- 
fiers (RF and NN) on our three different feature types (Ac- 
tion Counts - AC, Action Span - AS, Pairwise Embeddings 
- PW), using a stratified 10-fold nested cross validation. We 
kept the folds invariant across all experiments and strati- 
fied over the classes (according to the class labels of the 
4-class case). Because of class imbalance, we used random 
oversampling for the training sets. We used a nested cross 
validation to avoid potential bias introduced by estimating 
model performance during hyperparameter tuning. This al- 
lowed us to tune the hyperparameters within the training 
folds (by further splitting them) and hold out the test sets 
for performance evaluation alone. 
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Figure 7: AUC for 4-class, 3-class and 2-class cases using dif- 
ferent model and feature combinations. Predictions are made 
at the end of the interaction with the simulation, i.e. based 
on the complete sequential interaction data of the students. 


For RF models, we tuned the following hyperparameters 
using a grid search: number of trees [5, 7, 9], number of 
features used at each decision level [‘auto’, ‘all’], number 
of samples [bootstrap resampling of training size samples 
and balanced subsamples]. NN models were implemented 
using the scikit-learn library, trained for 300 epochs, and 
optimised for the log-loss function with the following hy- 
perparameters: learning rate [‘adaptive’, ‘invscaling’], initial 
learning rate [0.01, 0.001], solver [‘adam’, ‘sgd’], hidden layer 
sizes and number [(32, 16), (64, 32), (64, 32, 16), (128, 64, 
32, 16)], and activation function [‘relu’, ‘tanh’, ‘identity’]. 


The skip-gram model providing our pairwise embedding fea- 
tures was implemented using the TensorFlow package. We 
trained the model for 150 epochs, with a window size of 
w = 2, a batch size of 16, and an embedding dimension of 
15. We used categorical cross-entropy as the training loss. 
Because of its unsupervised nature, we trained the model on 
our whole dataset. 


Offline Classification. In a first experiment, we were inter- 
ested in assessing whether it is possible to associate students’ 
behaviour in the simulation with their conceptual under- 
standing achieved through the learning activity. This will 
be referred to as an offline classification task, since we are 
using students’ complete interaction sequences rs. The pre- 
dictive performance in terms of AUC for the three classifi- 
cation problems (4-class case, 3-class case, and 2-class case) 
with a distinction between different model and feature com- 
binations is illustrated in Fig. 7. 


The results of this first experiment showed that for the 2- 
class case, all combinations of models and features reached 
very high average performances as quantified by their AUC 
scores (value range: 0.95 — 0.97). The best mean score 
was achieved by the combination of NN with PW features 
(AUCNN,pw = 0.97). However, it should be noted that 
the performance differences between the combinations were 
comparatively small. Using a one-way ANOVA, no statisti- 
cally significant differences were found between the different 
groups (F'(5,54) = 0.839,p = 0.528). It seems that for 
this rather rough classification into groups of students who 
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Figure 8: AUC for 4 classes, 3 classes, and 2 classes using different model and feature combinations. Predictions are made over 
time, stopping at time step t = 150 as only very few students have longer interaction sequences. 


only understood the closed circuit and those who understood 
both the open and closed circuits, the different combinations 
of models and features perform equally well and with a high 
predictive accuracy. 


By extending the classification task to the 3-class case, the 
AUC scores dropped in comparison to the 2-class case (value 
range: 0.86 — 0.90). The lowest score was observed for RF 
with AC (AUCRr,ac = 0.86), while best performances were 
observed for RF with AS (AUCRr,as = 0.90), NN with AS 
(AUCwn,as = 0.89), and NN with PW (AUCNN, Pw = 
0.89). Similar to the 2-class case, no statistically significant 
differences were found between the different groups using a 
one-way ANOVA (F'(5,54) = 0.740, p = 0.597). This result 
illustrates that further dividing those who understood the 
functioning of the capacitor in both the open and closed 
circuit had a similar impact on the predictive performance 
of all model and feature combinations. 


Finally, when evaluating the 4-class case, the most complex 
classification task, the mean AUC scores further dropped for 
all combinations (value range: 0.78 — 0.84). The lowest per- 
formance was again observed for RF with AC (AUCrr, ac = 
0.78). Introducing the fourth class to the classification prob- 
lem seemed to have a smaller negative impact on the AUC 
scores for NN with AS (AUCwn,as = 0.83) and NN with 
PW (AUCwn,pw = 0.84), which obtain the best perfor- 
mances for the 4-class case. This observation was partially 
confirmed by a one-way ANOVA that showed a trend to sta- 
tistical significance for differences between the combinations 
(F(5, 54) = 1.989, p = 0.095): as we increase the amount of 
classes, the p-value decreases. 


The results of this first experiment show that it is possi- 
ble to perform an offline prediction of students’ conceptual 
understanding in the capacitor ranking task (see Section 2) 
based on the different combinations of models (NN or RF) 
and feature generation methods (AC, AS or PW) proposed 
in this work. From the entire data sequences of students’ 
interactions with the simulation, we observed that predic- 
tive performance generally decreased when the complexity 
of the classification task was increased. While all combina- 
tions showed very good performances for the coarse classi- 
fication of the 2-class case, AUC scores started to diverge 
more among combinations for the 3- and 4-class cases. Es- 
pecially for the 4-class case, where differences became more 


visible with certain combinations showing a trend of better 
predictive performance (i.e. NN with PW and NN with AS) 
as compared to others (e.g., RF with AC). 


Predicting over Time. In a second experiment, we were in- 
terested in assessing, whether we could predict students’ 
conceptual knowledge for shorter interaction sequences, i.e. 
when not using students’ full sequences, but only the first 
t interactions. For all three classification cases (2-class, 3- 
class, and 4-class), we trained all model (RF, NN) and fea- 
ture (AC, AS, PW) combinations for t = 10, 20, ..., 150 time 
steps. As described in Section 3, to compute the features 
fac,s,t, fas,s,t, and fpw,s, for a student s at time step f, 
we only use the student’s interactions 7 up to that point in 
time, ie. 7 = 1,...,t. Similarly, at each time step t, the 
models are exclusively used to predict students whose in- 
teractions sequences contain a minimum of ¢ elements (i.e. 
N; >t). For students with shorter interaction sequences 
(i.e. Ns < t), the last available prediction will be used. For 
example, for a student with 30 interactions (N; = 30), we 
would make the first three predictions using the models for 
t = 10, t = 20, and t = 30 time steps. For the remain- 
ing time steps, the predictions from t = 30 will be carried 
over. We chose this approach for predicting because we as- 
sume that the student will leave the simulation after N, 
interactions and it therefore does not make sense to update 
the prediction afterwards. Figure 8 illustrates the predic- 
tive performance in terms of AUC for the different model 
and feature combinations and all classification cases. 


As expected, for the 2-class case, all models achieve a high 
performance for long interaction sequences. The AUC of all 
NN models is larger than 0.9, starting at time step t = 70. 
Generally, the difference between the models and feature 
combinations for t > 70 is small. We also observe some 
model differences for earlier time steps, where the RF model 
with the action span features performs better than the other 
models. It achieves an AUC larger than 0.8 already at 
time step t = 30 (AUCrr.as = 0.82). Moreover, the NN 
model with action span is close to that performance at time 
step t = 30 (AUCwn,as = 0.79). Naturally, predicting at 
even earlier time steps is more difficult, but some of the 
models achieve a decent AUC of 0.7 already after observ- 
ing 20 interactions with the simulation (AUCRr,as = 0.74, 
AUCRr,ac = 0.73, AUCwn,as = 0.71). 
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Figure 9: Evolution of confusion matrices over time steps for the 3-class case: NN with PW (top) and RF with AS (bottom). 


For the 3-class case, the performances of all algorithms have 
decreased from time step t = 10 compared to the above prob- 
lem. This is not surprising, as differentiating the correct stu- 
dents from the areasep students is not as straightforward as 
separating students who did not interact in the open circuit 
from the rest. Though, in the 2-class cases, all performances 
were close to one another, we notice that for the 3-class 
problem at time step t = 40, three model-features combina- 
tions take the lead (AUCwn,pw = 0.75, AUCRe, as = 0.74, 
AUCwn,as = 0.74), while three fall behind (AUCRrr,pw = 
0.7, AUCRr.ac = 0.69 and AUCNn,ac = 0.68) until t = 70, 
where the NN with PW outperforms all other model and 
feature combinations with an AUC of 0.87. 


The variance across model-feature performances is larger in 
the 4-class case. From time step t = 20 already, RF with 
AS outperforms all other combinations, reaching an AUC 
of 0.8 at time step t = 100. Similarly to the 3-class prob- 
lem, the same three models take the lead, while the others 
fall behind from time step t = 70 (AUCNN,pw = 0.75, 
AUCRr,as = 0.78, AUCnn,as = 0.76 and AUCRr,pw 
0.72, AUCRr,ac = 0.72 and AUCwn,ac = 0.73). 


Given the fact the predictive performance in terms of AUC 
seems to be similar for the best performing models, we per- 
formed a more detailed analysis to assess how different the 
predictions of the several model and feature combinations 
were. In a real-world application (intervention setting), we 
would probably use a 2-class classifier to identify a coarse 
split (between classes both and closed) already at a rela- 
tively low number of time steps and use a 3-class method 
to provide a more detailed prediction later on. With the 
current model performance for the 4-class case, usage of a 
more detailed classifier seems not practicable. We there- 
fore investigate the predictions of the two best models for 
the 3-class case: NN with PW and RF with AS. Figure 9 
shows the confusion matrices for these two models for an 
increasing number of time steps. We do not show results 
for t > 100, as the predictive performance of the models 
does not improve much anymore for longer sequences (see 
also Fig. 8). While both models have a very similar predic- 
tive performance in terms of AUC up to time step 60, we 
can already see that the models evolve differently in terms 
of prediction. At time step t = 40, the RF model is al- 
ready very accurate in detecting students from class correct 
(p(correct|Ctrue = correct) = 0.78), while the NN model is 


168 


less confident (p(correct|Cirue = correct) = 0.67). On the 
other hand, the NN model is already more accurate in identi- 
fying students from class closed (p(closed|Ctrue = closed) = 
0.66), while the RF model cannot identify students from this 
class well (p(closed|ctrue = closed) = 0.42). At time step 
t = 60, both models are almost equally accurate in iden- 
tifiying students from class correct (NN: p(correct|Ctrue = 
correct) = 0.76, RF: p(correct|ctrue = correct) = 0.80). 
The NN model is still better at classifying students from the 
class closed. Both models have trouble with correctly identi- 
fying students from class areasep. While the NN model tends 
to assign these students to class closed (p(areasep|Ctrue = 
closed) = 0.52), the RF model is becoming better at cor- 
rectly assigning them (p(closed|Cirue = closed) = 0.42). 
These observed trends continue to get stronger with an in- 
creasing number of time steps. At ¢ = 100, the NN model 
is very accurate when it comes to classifying students from 
classes correct and closed. Students from class areasep have 
only a 35% chance of being correctly classified and a 55% 
chance of getting assigned to class closed. In practice, this 
would mean that 55% of the students would get more in- 
tervention (hints) than necessary. The RF classifier is also 
very accurate in detecting students from class correct, but 
is, however, not able to distinguish between students from 
class closed and class areasep. In practice, this would mean 
misclassified students from class closed would get less help 
than necessary and misclassified students from class areasep 
would get more help than necessary. 


This experiment shows that we can (coarsely) classify stu- 
dents after observing a relatively low number of interactions. 
For the 2-class case, the AUC of the best model (RF with 
AS) is larger than 0.8 after t = 30 time steps. Naturally, 
the classification task is more complex for the 3-class and 
the 4-class cases. The best model on the 3-class case (NN 
with PW) achieves an AUC close to 0.8 at time step 50. 
The second analysis demonstrates that achieving a similar 
predictive performance in terms of AUC does not imply the 
same classification behaviour, i.e.the best models on the 3- 
class case (NN with PW and RF with AS) have different 
strengths. It has, however, one important limitation: stu- 
dents spend different amount of times on the simulation and 
therefore, the length of their interaction sequences varies. 
There are for example students with 80 interactions and 
other students with only 50 interactions. Performing the 
classification task at time step 40 is early for a student with 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


2-class case 3-class case 4-class case 
—— NNwithPw — RF with PW 
~ NN with AC + RF with AC 


—-» NN with AS —-» RF with AS 


% interactions 


% interactions 


% interactions 


Figure 10: AUC on the 2-class, 3-class and 4-class cases for different model and feature combinations. Predictions were made 
for 25%, 50%, 75% and 100% of the total number of interactions for each student. 


a total of 80 interactions. For a student with only 50 inter- 
actions in total, the prediction at time step 40 comes almost 
at the end of the interaction time with the simulation. 


Online (Early) Classification. The third and last experiment 
addresses the limitations of the previous experiment. In this 
experiment, we were interested in assessing the ”early” pre- 
dictive performances of the different models using only a 
part of a student’s sequence. Given the fact that the length 
of interaction sequences varies over the students, we did not 
align the sequences by absolute time steps, but by percent- 
ages of interactions. Specifically, we aimed at making the 
prediction for a student after having seen 25%, 50%, 75%, 
and 100% of the interaction sequence of this student. Note 
that this experiment does not require a re-training of our 
models. We just retrieve the predictions of the models for 
the corresponding time step t. For our example student with 
a total of 80 interactions, we retrieve the predictions of the 
models for time steps t = 20, t = 40, t = 60, and t = 80. 
Figure 10 shows the AUC of the models for all classification 
cases, with an increasing number of interactions (in %). 


As expected, all model and feature combinations perform 
well for the 2-class case. To achieve a high classification 
accuracy, we do not need to observe the full sequence of 
a student. For all NN models, obtaining 75% of students’ 
interactions is enough to achieve an AUC of around 0.9. 
With RF, the model using the action span features also ob- 
tains an AUC of more than 0.9 at 75% of the interactions 
(AUCRrr,as = 0.92). The performance of the two other fea- 
ture types is slightly lower (AUCRrr,ac = 0.9, AUCRF,pw = 
0.88). Naturally, predictive performance of all the models 
is lower when observing smaller parts of students’ interac- 
tion sequences. If obtaining only the first 25% of students’ 
interactions, there is more variation in the achieved AUC 
between models, with the best model (RF with AS) achiev- 
ing an AUC of 0.66 and the worst model achieving an AUC 
of 0.57 (RF with PW). It is promising that the best model 
at 50% of the interactions exhibits an AUC of almost 0.8 
(AUCRrr,as = 0.78), which makes it a valuable candidate 
for a coarse early prediction and intervention, i.e. differ- 
entiating between students with a high conceptual under- 
standing (class both) and students with a low conceptual 
understanding (class closed), early on. 


Naturally, performance of the models for the 3-class case is 


overall lower as we are now differentiating the different levels 
of conceptual knowledge in a more fine-grained way. As we 
have seen in Fig. 9), it is difficult to differentiate between 
students from the left branch of the tree (i.e. correct vs. 
areasep in Fig. 2). Performance across models varies more 
for the 3-class case. When observing 75% of students inter- 
actions, the AUC of the worst model (NN with AS) amounts 
to 0.78, while the best model (NN with PW) has an AUC 
of 0.84. We also observe that the NN with PW features is 
consistently the best model, regardless of the amount of ob- 
served interactions, with an increasing gap to the other mod- 
els. At 50% of interactions, the gap in performance among 
the three best models is still small (AUCnn,pw = 0.699, 
AUCRrr,as = 0.7, AUCRr,pw = 0.69). It gets larger at 75% 
(AUCwn,pw = 0.84, AUCRFAs = 0.82, AUCNN As = 
0.78). When observing the complete sequences of the stu- 
dents (ie. 100% of the interactions), all models reach an 
AUC of 0.85 or higher (see also Fig.7). 


Again, the performance decreases when moving to four classes, 
due to the increasing complexity (in terms of level of detail) 
of the classification task. The AUC of the best model (NN 
with PW) amounts to 0.83, while for the worst two models 
AUC = 0.77 (RF with AC, NN with AC). While all the mod- 
els’ AUC is lower than 0.7 when observing only 25% or 50% 
of students’ interactions, interestingly there is a large gap be- 
tween the best model (RF with AS) and all the other models 
(ie. at 25%: AUCRr,as = 0.59, AUCRF,ac = 0.56). 


With this last experiment, we assessed the capabilities of 
our models to make predictions as early as possible during 
interaction with the simulation as a basis for intervention. 
By evaluating the models at different percentages of total 
interactions, we took into account the fact that the defini- 
tion of ‘early’ depends on the student. Our results show that 
after observing the first 50% of interactions, we are able to 
reliably distinguish between students with a high and low 
conceptual understanding gained by the end of the learning 
activity (both and closed). At 75% of interactions, the best 
models are also able to provide a more fine-grained predic- 
tion (correct, areasep, and closed). 


5. DISCUSSION 


Over the last decade, interactive simulations of scientific 
phenomena have become increasingly popular. They allow 
students to learn the principles underlying a domain through 
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their own explorations. However, only few students pos- 
sess the degree of inquiry skills and self-regulation necessary 
for effective learning in these environments. In this paper, 
we therefore explored approaches for an early identification 
of struggling students as a basis for adaptive guidance: we 
aimed at predicting students’ conceptual knowledge while 
interacting with a Physics simulation. Specifically, we were 
interested in answering the following three research ques- 
tions: 1) Can students’ interaction with the data be associ- 
ated with the gained conceptual understanding? 2) Can con- 
ceptual understanding be inferred through sequence mining 
methods with embeddings? 3) Can the proposed methods be 
used for early predicting students’ conceptual understanding 
based on partial sequences of interaction data? 


To answer the first research question, we analysed data from 
192 first-year undergraduate Physics students who used an 
interactive capacitor simulation to solve a task in which they 
had to rank four capacitor configurations by their stored 
energy. Previous research has emphasised the importance 
of aligning instructional and assessment activities to imple- 
ment pedagogically meaningful learning activities with ed- 
ucational technology [26, 27]. Since one objective of auto- 
matically detecting student learning behavior is to provide 
some kind of assessment (either formative or summative), 
the learning task presented in this work was designed to fa- 
cilitate the application of sequence mining. The design of 
this learning task allowed us to relate all of the students an- 
swers to a certain level of conceptual understanding. Using a 
decision tree, we were able to map each ranking to one of the 
four labels representing groups of similar conceptual under- 
standing. Our results show that all evaluated models were 
able to correctly associate students’ sequential interaction 
data in the simulation with the generated labels, achieving 
high predictive performance when fed with full sequences 
(2-class case: AUC > 0.9, 3-class case: AUC > 0.85, 4-class 
case: AUC > 0.75). This high predictive power was also ob- 
served in [22], where they reached an accuracy of 85% when 
separating ‘high’ learners from ‘low’ learners based of full 
interaction sequences. However, despite their findings of a 
potential third cluster, they did not investigate the ternary 
classification task. In this paper, we increase the granularity 
of the labels in order to target more specific shortcomings in 
the knowledge of the students in order to provide them with 
more detailed feedback. We therefore conclude that we can 
answer research question 1) with yes. 


The second research question investigates the benefits of la- 
tent features generated by skip-grams for offline classifica- 
tion tasks in the context of education. Usually applied to 
NLP problems, skip-grams have the ability to learn the con- 
text of a word in an unsupervised fashion. In our case, 
we use it to find the OELE behaviour of the students sur- 
rounding their interaction, and retrieve the embedding ma- 
trix of the neural network to create our latent representa- 
tions. This approach has already been proven efficient to 
analyse student strategies in blended courses [28], but not 
for the identification of conceptual understanding. To eval- 
uate the predictive power of latent feature representations, 
we trained two classifiers (NN and RF) on three types of 
feature (Action Counts, Action Span and Pairwise Embed- 
dings) on the full sequences of students. At first, we notice 
that all model and feature combinations achieve a high AUC 


for all classification tasks (2-class case, 3-class case, and 4- 
class case). Though the ANOVA revealed no significant dif- 
ferences between the predictive performance of models with 
different types of features, we can observe that the NN with 
PW achieves a higher performance on average than all other 
combinations but the NN with AS. What is more, the per- 
formances in its first quartile dominate those from the third 
quartile of three model and feature combinations. Addi- 
tionally, its performance variance is smaller than that of the 
NN with AS. This shows that pairwise embeddings gener- 
ated by a skip-gram approach can be a valuable asset for 
finer-grained classification, even if no statistical difference 
was found with respect to the other model-feature combina- 
tions. We can therefore answer research question 2) with a 
partial yes. 


To address the third research question, we assessed predic- 
tive performance of the proposed approach when only partial 
sequences of the students’ interaction data were observed. 
We analysed the performances of our proposed approaches 
based on varying proportions of the available data and for 
classification tasks with different levels of complexity. The 
results of our experiments show that the proposed combina- 
tions of models and generated features allowed us to predict 
the correct class labels early on. The best models were able 
to reliably predict students’ conceptual understanding for 
the 2-class case (AUC ~% 0.8) after having seen 50% of the 
students’ interaction data. To reach a similar predictive per- 
formance for the more fine-grained 3-class and 4-class cases, 
the best models needed about 75% of the data. The findings 
from these experiments therefore represent a promising step 
towards early prediction of students’ conceptual understand- 
ing in OELEs. We can therefore answer research question 
3) with yes. 


One of the limitations of this work is the unfeasibility to 
track whether students used external resources (other than 
the simulation) in order to rank the four capacitor configura- 
tions. This may bias the inference from the simulation usage 
to the extrapolated understanding level. Furthermore, due 
to our small sample size, we were able to only train shallow 
NN classifiers and skip-grams. Finally, the external valid- 
ity of these experiments remains to be evaluated on other 
interactive simulations and different types of tasks. 


To conclude, the proposed approach represents a promising 
step towards early prediction of students’ learning strategies 
in interactive simulations, that moreover, can be associated 
with their level of conceptual understanding. The proposed 
learning activity seems to represent an interesting example 
for the design of learning tasks in OELEs that facilitates the 
association of detected student strategies with conceptual 
understanding through sequence mining. Future work could 
explore whether such designs could also be used to identify 
conceptual understanding at a more fine-grained level. 
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